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Abstract 

Motivation: Modern data acquisition based on high-throughput technology is often facing the 
problem of missing data. Algorithms commonly used in the analysis of such large-scale data often 
depend on a complete set. Missing value imputation offers a solution to this problem. However, 
the majority of available imputation methods are restricted to one type of variable only: continuous 
or categorical. For mixed-type data the different types are usually handled separately. Therefore, 
these methods ignore possible relations between variable types. We propose a nonparametric method 
which can cope with different types of variables simultaneously. 

Results: We compare several state of the art methods for the imputation of missing values. We 
propose and evaluate an iterative imputation method (missForest) based on a random forest. By av- 
eraging over many unpruned classification or regression trees random forest intrinsically constitutes 
a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest we are 
able to estimate the imputation error without the need of a test set. Evaluation is performed on multi- 
ple data sets coming from a diverse selection of biological fields with artificially introduced missing 
values ranging from 10% to 30%. We show that missForest can successfully handle missing values, 
particularly in data sets including different types of variables. In our comparative study missForest 
outperforms other methods of imputation especially in data settings where complex interactions and 
nonlinear relations are suspected. The out-of-bag imputation error estimates of missForest prove to 
be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and 
can cope with high-dimensional data. 



Availability: The R package missForest is freely available from http : / /stat . ethz . 
|ch/CRAN/ . 

Pre-print version: This article has been submitted to Oxford 
Journal's Bioinformatics® on 3rd of May 2011. Version 2 has 
been resubmitted on 27th of September 201 1. 



1 Introduction 

Imputation of missing values is often a crucial step in data analysis. Many established methods of 
analysis require fully observed data sets without any missing values. However, this is seldom the case 
in medical and biological research today. The ongoing development of new and enhanced measurement 
techniques in these fields provides data analysts with challenges prompted not only by high-dimensional 
multivariate data where the number of variables may greatly exceed the number of observations but also 
by mixed data types where continuous and categorical variables are present. In our context categorical 
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variables can arise as any kind ranging from technical settings in a mass spectrometer to a diagnostic 
expert opinion on a disease state. Additionally, such data sets often contain complex interactions and 
nonlinear relation structures which are notoriously hard to capture with parametric procedures. 

Most prevalent imputation methods, like k nearest neighbours (KNNimpute, ITroyanskaya et al. 



(2001 1) for continuous data, saturated multinomial model (Schafer ( 1997)) for categorical data and mul- 
tivariate imputation by chained equations (MICE, Van Buuren and Oudshoom ( 1999 1) for mixed data 
types depend on tuning parameters or specification of a parametric model. The choice of such tuning pa- 
rameters or models without prior knowledge is difficult and might have a dramatic effect on a method's 
performance. Excluding MICE the above methods and the majority of other imputation methods are 
restricted to one type of variable. Furthermore, all these methods make assumptions about the distri- 
bution of the data or subsets of the variables, leading to questionable situations, e.g. assuming normal 
distributions. 

The literature on mixed-type data imputation is rather scarce. Its first appearance was in the develop- 
ing field of multiple imputation brought up by Rubin ( 1978 1. Little and Schluchter ( 1985| l presented an 
approach based on maximum likelihood estimation combining the multivariate normal model for con- 
tinuous and the Poisson/multinomial model for categorical data. This idea was later on extended in the 
book of [Little and Rublnl fl987 ) . See also |Li] ([1988]), [Rubin and Schafer] ( [T9901 ) and |Schafer| ([1997). 
A more refined method to combine different regression models for mixed-type data was proposed by 



Van Buuren and Oudshoorn (1999) using chained equations. The conditional model in MICE can be 



specified for the missing data in each incomplete variable. Therefore no multivariate model covering 
the entire data set has to be specified. However, it is assumed that such a full multivariate distribution 
exists and missing values are sampled from conditional distributions based on this full distribution (for 
more details see Section |3]). Another similar method using variable-wise conditional distributions was 
proposed by 'Raghunathan et al. (2001 ) called sequential regression multivariate imputation. Unlike in 
MICE the predictors must not be incomplete. The method is focussed on survey data and therefore in- 
cludes strategies to incorporate restrictions on subsamples of individuals and logical bounds based on 
domain knowledge about the variables, e.g., only women can have a number of pregnancies recorded. 
Our motivation is to introduce a method of imputation which can handle any type of input data and 



makes as few as possible assumptions about structural aspects of the data. Random forest (RF, Breiman 



(2001 )) is able to deal with mixed- type data and as a nonparametric method it allows for interactive 
and nonlinear (regression) effects. We address the missing data problem using an iterative imputation 
scheme by training a RF on observed values in a first step, followed by predicting the missing values and 
then proceeding iteratively. Mazumder et al. ( 2010| ) use a similar approach for the matrix completion 
problem using a soft-thresholded SVD iteratively replacing the missing values. We choose RF because 
it can handle mixed-type data and is known to perform very well under barren conditions like high 
dimensions, complex interactions and nonlinear data structures. Due to its accuracy and robustness RF 
is well suited for the use in applied research often harbouring such conditions. Furthermore, the RF 
algorithm allows for estimating out-of-bag (OOB) error rates without the need for a test set. For further 



details see Breiman (2001 



Here we compare our method with /c-nearest neighbour imputation (KNNimpute, 'Troyanskaya et a/J 
(2001])) and the Missingness Pattern Alternating Lasso (MissPALasso) algorithm by Stadler and Biihlmannj 
(2010 ) on data sets having continuous variables only. For the cases of categorical and mixed type of vari- 
ables we compare our method with the MICE algorithm by Van Buuren and Oudshoom (1999) and a 
dummy variable encoded KNNimpute. Comparisons are performed on several data sets coming from 
different fields of life sciences and using different proportions of missing values. 

We show that our approach is competitive to or outperforms the compared methods on the used data 
sets irrespectively of the variable type composition, the data dimensionality, the source of the data or the 
amount of missing values. In some cases the decrease of imputation error is up to 50%. This performance 
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is typically reached within only a few iterations which makes our method also computationally attractive. 
The OOB imputation error estimates give a very good approximation of the true imputation error having 
on average a proportional deviation of no more than 10 - 15%. Furthermore, our approach needs no 
tuning parameter, hence, is easy to use and needs no prior knowledge about the data. 



2 Approach 

We assume X = (Xi, X2, . . . , Xp) to be a n x p-dimensional data matrix. We propose using a random 
forest to impute the missing values due to its earlier mentioned advantages as a regression method. The 
random forest algorithm has a built-in routine to handle missing values by weighting the frequency of 
the observed values in a variable with the random forest proximities after being trained on the initially 



mean imputed data set (Breiman (2001 1). However, this approach requires a complete response variable 
for training the forest. 

Instead, we directly predict the missing values using a random forest trained on the observed parts of 
the data set. For an arbitrary variable X^ including missing values at entries i^)^ C {1, . . . , n} we can 
separate the data set in four parts: 

(s) 

1. The observed values of variable X^, denoted by y^^^; 

(s) 

2. the missing values of variable X^, denoted by y „j^^; 

3. the variables other than X^ with observations i^^^^ = {1, . . . , n} \ i^]^ denoted by x^^^^; 

(s) (s) 

4. the variables other than X^ with observations i^-^ denoted by x^^^. 

(s) (s) 

Note that x^^^^ is typically not completely observed since the index i^^^^^ corresponds to the observed 

(s) 

values of the variable X^. Likewise, x^^^ is typically not completely missing. 

To begin, make an initial guess for the missing values in X using mean imputation or another im- 
putation method. Then, sort the variables X^, s = 1, . . . ,p according to the amount of missing values 
starting with the lowest amount. For each variable X^ the missing values are imputed by first fitting a 

(s) (s) (s) 

random forest with response y^^^ and predictors x^^^; then, predicting the missing values y^^^ by ap- 

(s) 

plying the trained random forest to x^ .g. The imputation procedure is repeated until a stopping criterion 
is met. The pseudo algorithm [T] gives a representation of the missForest method. 
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Algorithm 1 Impute missing values with random forest. 



Require: X an n x p matrix, stopping criterion 7 
1: Make initial guess for missing values; 
2: k ^ vector of sorted indices of columns in X 
w.r.t. increasing amount of missing values; 
while not 7 do 

X"?T ^ store previously imputed matrix; 



3 
4 
5 
6 
7 
8 
9 

10; 
11 
12 



old 

for s in k do 

Fit a random forest: y^^^^ ~ x^^^^; 

Predict using x[^|,; 

X™£ ^ update imputed matrix, using predicted y^]^ ; 
end for 
update 7. 
end while 

return the imputed matrix X*™^' 



The stopping criterion 7 is met as soon as the difference between the newly imputed data matrix 
and the previous one increases for the first time with respect to both variable types, if present. Here, the 
difference for the set of continuous variables N is defined as 

El^imp ■vi'mp\2 
jGNV^new — J^gi^ ) 
lXm = -■ , 

and for the set of categorical variables F as 

= m ' 

where #NA is the number of missing values in the categorical variables. 

After imputing the missing values the performance is assessed using the normalised root mean 



squared error (NRMSE, Oba et al. ( 2003 1) for the continuous variables which is defined by 



NRMSE 



/mean((X*™e - X^"'p)2 



var (X^ 



true\ 



where X**^"^ is the complete data matrix and X*"^^' the imputed data matrix. We use mean and var as 
short notation for empirical mean and variance computed over the continuous missing values only. For 
categorical variables we use the proportion of falsely classified entries (PFC) over the categorical missing 
values, Ap. In both cases good performance leads to a value close to and bad performance to a value 
around I. 

When a RF is fit to the observed part of a variable we also get an OOB error estimate for that 
variable. After the stopping criterion 7 was met we average over the set of variables of the same type to 
approximate the true imputation errors. We assess the performance of this estimation by comparing the 
absolute difference between true imputation error and OOB imputation error estimate in all simulation 
runs. 

3 Methods 

We compare missForest with four methods on ten different data sets where we distinguish between 
situations with continuous variables only, categorical variables only and mixed variable types. 
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The most well-known method for imputation of continuous data sets especially in the field of gene 
expression analysis is the KNNimpute algorithm by Troyanskaya et al. ( |2001 1. A missing value variable 
Xj is imputed by finding its k nearest observed variables and taking a weighted mean of these k variables 
for imputation. Thereby, the weights depend on the distance of the variable Xj. The distance itself is 
usually chosen to be the Euclidean distance. 

When using KNNimpute the choice of the tuning parameter k can have a large effect on the perfor- 
mance of the imputation. However, this parameter is not known beforehand. Since our method includes 
no such parameter we implement a cross-validation (see Algorithm |2]) to obtain a suitable k. 



Algorithm 2 Cross-validation KNN imputation. 

Require: X an n x p matrix, number of validation sets /, range of suitable number of nearest neighbours 
K 

1: X*^^ ^ initial imputation using mean imputation; 
2: for t in 1 do 

3: X^l^ J ^ artificially introduce missing values to X*-^^; 
4: for /c in K do 

5: X^^jy ^ <r- KNN imputation of X^^ ^ using k nearest neighbours; 
6: £k,t ^ error of KNN imputation for k and t; 
7: end for 
8: end for 

9: hest ^ argmin} XlLi ^k/, 
k 

10: X*'"^ ^ KNN imputation of X using kitest nearest neighbours. 



In the original paper of [Troyanskaya et al] ( |2001| l the data was not standardized before applying the 
KNNimpute algorithm. This constitutes no issue in the case of gene expression data because such data 
generally consists of variables on similar scales. However, we are applying the KNNimpute algorithm 
to data sets with varying scales in the variables. To avoid variance based weighting of the variables we 
scale them to a unit standard deviation. We also center the variables at zero. After imputation the data is 
retransformed such that the error is computed on the original scales. This last step is performed because 
missForest does not need any transformation of the data and we want to compare the performance of the 
methods on the original scales of the data. 

Another approach for continuous data, especially in the case of high-dimensional normal data ma- 



trices, is presented by Stadler and Biihlmann (20101 using an EM-type algorithm. In their Missingness 
Pattern Alternating Imputation and /i -penalty (MissPALasso) algorithm the missing variables are re- 
gressed on the observed ones using the lasso penalty by ITibshirani ( 1996 1. In the following E step the 
obtained regression coefficients are used to partially update the latent distribution. The MissPALasso has 
also a tuning parameter A for the penalty. As with KNNimpute we use cross-validation to tune A (cf. 
Algorithm |2]). When applying MissPALasso the data is standardized as regularization with a single A 
requires the different regressions to be on the same scale. 



In the comparative experiments with categorical or mixed-type variables we use the MICE algorithm 
by [Van Buuren and Oudshoorn ( 1999 1 based on the multivariate multiple imputation scheme of Schafer 
([19971). In contrast to the latter the conditional distribution for the missing data in each incomplete 
variable is specified in MICE, a feature called fully conditional specification by Van Buuren ( |2007| l. 
However, the existence of a multivariate distribution from which the conditional distribution can be 
easily derived is assumed. Furthermore, iterative Gibbs sampling from the conditional distributions can 
generate draws from the multivariate distribution. We want to point out that MICE in its default setup 
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is not mainly intended for simple missing value imputation. Using the multiple imputation scheme 
MICE allows for assessing the uncertainty of the imputed values. It includes features to pool multiple 
imputations, choose individual sampling procedures and allows for passive imputation controlling the 
sync of transformed variables. In our experiments we used MICE with either linear regression with 
normal errors or mean imputation for continuous variables, logistic regression for binary variables and 
polytomous logistic regression for categorical variables with more than two categories. 

For comparison across different types of variables we apply the KNNimpute algorithm with dummy 
coding for the categorical variables. This is done by coding a categorical variable Xj into m dichoto- 
mous variables Xj,m G {—1; 1}- Application of the KNNimpute algorithm for categorical data can be 
summarized as: 

1. Code all categorical variables into {—1, 1} -dummy variables; 

2. standardize all variables to mean and standard deviation 1 ; 

3. apply the cross-validated KNNimpute method from Algorithm|2| 

4. retransform the imputed data matrix to the original scales; 

5. code the dummy variables back to categorical variables; 

6. computed the imputation error. 

For each experiment we perform 50 independent simulations where 10%, 20% or 30% of the values 
are removed completely at random. Each method is then applied and the NRMSE, the PFC or both are 
computed (see Section|2]l. We perform a paired Wilcoxon test of the error rates of the compared methods 
versus the error rates of missForest. In addition, the OOB error estimates of missForest is recorded in 
each simulation. 

4 Results 

4.1 Continuous variables only 

First, we focus on continuous data. We investigate the following four publicly available data sets: 

• Isoprenoid gene network in A. thaliana: This gene network includes p = 39 genes each with 
n = 118 gene expression profiles corresponding to different experimental conditions. For more 
details on this data set see |Wille et a/.| ( |2004] ). 



Voice measures in Parkinson's patients: The data described by [Little et aL\ ( |2008| l contains a 
range of biomedical voice measurements from 31 individuals, 23 with Parkinson's disease (PD). 
There are p = 22 particular voice measurements and n = 195 voice recordings from these indi- 
viduals. The data set also contains a response variable giving the health status. Dealing only with 
continuous variables the response was removed from the data. We will return to this later on. 

Sliapes of musk molecules: This data set describes 92 molecules of which 47 are musks and 
45 are non-musks. For each molecule p = 166 features describe its conformation, but since a 
molecule can have many conformations due to rotating bonds, there are n = 476 different low- 
energy conformations in the set. The classification into musk and non-musk molecules is removed. 
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Figure 1: Continuous data. Average NRMSE for KNNimpute (grey), MissPALasso (white) and miss- 
Forest (black) on four different data sets and three different amounts of missing values, i.e., 10%, 20% 
and 30%. Standard errors are in the order of magnitude of 10""^. Significance levels for the paired 
Wilcoxon tests in favour of missForest are encoded as "*" <0.05, "**" <0.01 and "***" <0.001. If the 
average error of the compared method is smaller than that of missForest the significance level is encoded 
by a hash (#) instead of an asterisk. In the lowermost data set results for MissPALasso are missing due 
to the methods limited capability with regard to high dimensions. 



Insulin gene expression: This high-dimensional data set originates from an analysis by IWu et al. 



( 2007 1 of vastus lateralis muscle biopsies from three different types of patients following insulin 
treatment. The three types are insulin-sensitive, insulin-resistant and diabetic patients. The anal- 
ysis involves p = 12'626 genes whose expression levels were measured from n = 110 muscle 
biopsies. Due to computation time we only perform 10 simulations instead of 50. 

Results are given in Figure [T] We can see that missForest performs well, sometimes reducing the 
average NRMSE by up to 25% with respect to KNNimpute. In case of the musk molecules data the 
reduction is even above 50%. The MissPALasso performs slightly better than missForest on the gene 
expression data. However, there are no results for the MissPALasso in case of the Insulin data set 
because the high dimension makes computation not feasible. 

For continuous data the missForest algorithm typically reaches the stopping criterion quite fast need- 
ing about 5 iterations. The imputation takes about 10 times as long as performing the cross-validated 
KNNimpute where {1, . . . , 15} is the set of possible numbers of neighbours. For the Insulin data set an 
imputation takes on average 2 hours on a customary available desktop computer. 



4.2 Categorical variables only 

We also consider data sets with only categorical variables. Here, we use the MICE algorithm described 
in Section[3]instead of the MissPALasso. We use a dummy implementation of the KNNimpute algorithm 
to deal with categorical variables (see Section [3]). We apply the methods to the following data sets: 



Cardiac single photon emission computed tomography (SPECT) images: [Kurgan et a/.| ( |2001) 
discuss this processed data set summarizing over 3000 2D SPECT images from n = 267 patients 
in p = 22 binary feature patterns. 
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Figure 2: Categorical data. Average PFC for cross-validated KNNimpute (grey), MICE (white) and 
missForest (black) on three different data sets and three different amounts of missing values, i.e., 10%, 
20% and 30%. Standard errors are in the order of magnitude of 10~^. Significance levels for the paired 
Wilcoxon tests in favour of missForest are encoded as <0.05, "**" <0.01 and "***" <0.001. 



Promoter gene sequences in E. coli: The data set contains sequences found by Harley and 



Reynolds ( |1987 1 for promoters and sequences found by Towell et al. (19901 for non-promoters 



totalling n = 106. For each candidate a sequence of 57 base pairs was recorded. Each variable can 
take one of four DNA nucleotides, i.e., adenine, thymine, guanine or cytosine. Another variable 
distinguishes between promoter and non-promoter instances. 

• Lymphography domain data: The observations were obtained from patients suffering from can- 
cer in the lymphatic of the immune system. For each of the n = 148 lymphoma p = 19 different 
properties were recorded mainly in a nominal fashion. There are nine binary variables. The rest of 
the variables have three or more levels. 

In Figure[2]we can see that missForest is always imputing the missing values better than the compared 
methods. In some cases - namely for the SPECT data - the decrease of PFC compared to MICE is up 
to 60%. However, for the other data sets the decrease is less pronounced ranging around 10 - 20% - but 
there still is a decrease. The amount of missing values on the other hand seems to have only a minor 
influence on the performance of all methods. Except for MICE on the SPECT data, error rates remain 
almost constant increasing only by 1 - 2%. We pointed out earlier that MICE is not primarily tailored for 
imputation performance but offers additional possibilities of assessing uncertainty of the imputed values 
due to the multiple imputation scheme. Anyhow, the results using the cross-validated KNNimpute (see 
Algorithm [2]) on the dummy-coded categorical variables is surprising. The imputation for missForest 
needs on average 5 times as long as a cross-validated imputation using KNNimpute. 



4.3 Mixed-type variables 

In the following we investigate four data sets where the first one has already been introduced, i.e. musk 
molecules data including the categorical response yielding the classification. The other data sets are: 

• Proteomics biomarkers for Gaucher's disease: Gaucher's disease is a rare inherited enzyme 
deficiency. In this data set Smit et al. ( |2007 1 present protein arrays for biomarkers (p = 590) from 



blood serum samples (n = 40). The binary response distinguishes between disease status. 

Gene finding over prediction (GFOP) peptide search: This data set comprises mass-spectrometric 
measurements of n = 595 peptides from two shotgun proteomics experiments on the nematode 
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Caenorhabditis elegans. The collection of p = 18 biological, technical and analytical variables 
had the aim of novel peptide detection in a search on an extended database using established gene 
prediction methods. 

Children's Hospital data: This data set is the product of a systematic long-term review of children 
with congenital heart defects after open-heart surgery. Next to defect and surgery related variables 
also long-term psychological adjustment and health-related quality of life was assessed. After 
removing observations with missing values the data set consists of n = 55 patients and p = 124 



variables of which 48 are continuous and 76 are categorical. For further details see Latal et al. 



(20091. 



The results of this comparison are given in Figure [3] We can see that missForest performs better than 
the other two methods, again reducing imputation error in many cases by more than 50%. For the GFOP 
data, KNNimpute has a slightly smaller NRMSE than missForest but makes twice as much error on the 
categorical variables. Generally, with respect to the amount of missing values the NRMSE tends to have 
a greater variability than the PFC which remains largely the same. 

The imputation results for MICE on the Children's Hospital data have to be treated cautiously. Since 
this data set contains ill-distributed and nearly dependent variables, e.g., binary variables with very few 
observations in one category, the missingness pattern has a direct influence on the operability of the 
MICE implementation in the statistical software R. The imputation error illustrated in Figure [3] was 
computed from 50 successful simulations by randomly generating missingness patterns, which did not 
include only complete cases or no complete cases at all within the categories of the variables. Therefore, 
the actual numbers of simulations were larger than 50 for all three missing value amounts. Furthermore, 
nearly dependent variables were removed after each introduction of missing values. This leads to an 
average of 7 removed variables in each simulation. Due to this ad-hoc manipulation for making the 
MICE implementation work, we do not report significance statements for the imputation error. 



4.4 Estimating imputation error 

In each experiment we get for each simulation run an OOB estimate for the imputation error. In Figure 
|4]the differences of true imputation error, errtrue, and OOB error estimates, erfooB> are illustrated for the 
continuous and the categorical data sets. Also, the mean of the true imputation error and the OOB error 
estimate over all simulations is depicted. 

We can see that for the Isoprenoid and Musk data sets the OOB estimates are very accurate only 
differing from the true imputation error by a few percents. In case of the Parkinson's data set the OOB 
estimates exhibit a lot more variability than in all other data sets. However, on average the estimation 
is comparably good. For the categorical data sets the estimation accuracy behaves similarly over all 
scenarios. The OOB estimates tend to underestimate the imputation error with increasing amount of 
missing values. Apparently, the absolute size of the imputation error seems to play a minor role in the 
accuracy of the OOB estimates which can be seen nicely when comparing the SPECT and the Promoter 
data. 



4.5 Computational efficiency 

We assess the computational cost of missForest by comparing the runtimes of imputation on the previous 
data sets. Table [T] shows the runtimes in seconds of all methods on the analyzed data sets. We can see 
that KNNimpute is by far the fastest method. However, missForest runs considerably faster than MICE 
and the MissPALasso. In addition, applying missForest did not require antecedent standardization of 
the data, laborious dummy coding of categorical variables nor implementation of CV choices for tuning 
parameters. 
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Figure 3: Mixed-type data. Average NRMSE (left bar) and PFC (right bar, shaded) for KNNimpute 
(grey), MICE (white) and missForest (black) on four different data sets and three different amounts 
of missing values, i.e., 10%, 20% and 30%. Standard errors are in the order of magnitude of 10"'^. 
Significance levels for the paired Wilcoxon tests in favour of missForest are encoded as "*" <0.05, "**" 
<0.01 and "***" <0.001. If the average error of the compared method is smaller than that of missForest 
the significance level is encoded by a hash (#) instead of an asterisk. Note that, due to ill-distribution 
and near dependence in the Child hospital data, the results for MICE have to be treated with caution (see 
Section [431). 



There are two possible ways to speed up computation. The first one is to reduce the number of 
trees grown in each forest. In all comparative studies the number of trees was set to 100 which offers 
high precision but increased runtime. In Table [2] we can see that changing the number of trees in the 
forest has a stagnating influence on imputation error, but a strong influence on computation time which 
is approximately linear in the number of trees. 

The second one is to reduce the number of variables randomly selected at each node (rritry) to set up 
the split. Table [2] shows that increasing mtry has limited effect on imputation error, but computation time 
is strongly increased. Note that for mtry = 1 we do not longer have a random forest, since there is no 
more choice between variables to split on. This leads to a much higher imputation error, especially for 
the cases with low numbers of bootstrapped trees. We use for all experiments [y^J as default value, e.g., 
in the GFOP data this equals 4. 



5 Conclusion 

Our new algorithm, missForest, allows for missing value imputation on basically any kind of data. In 
particular, it can handle multivariate data consisting of continuous and categorical variables simultane- 
ously. MissForest has no need for tuning parameters nor does it require assumptions about distributional 
aspects of the data. We show on several real data sets coming from different biological and medical 
fields that missForest outperforms established imputation methods like A;-nearest neighbours imputation 
or multivariate imputation using chained equations. Using our OOB imputation error estimates miss- 
Forest offers a way to assess the quality of an imputation without the need of setting aside test data nor 
performing laborious cross-validations. For subsequent analysis these error estimates represent a mean 
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Figure 4: Difference of true imputation error errtrue and OOB imputation error estimate erfooB for the 
continuous data sets (left) and the categorical data sets (right) and three different amounts of missing 
values, i.e., 0.1, 0.2 and 0.3. In each case the average errtrue (circle) and the average erfooB (plus) over 
all simulations is given. 



Data set 


n 


P 


KNN 


MissPALasso 


MICE 


missForest 


Isoprenoid 


118 


39 


0.8 


170 




5.8 


Parkinson's 


195 


22 


0.7 


120 




6.1 


Musk (cont.) 


476 


166 


13 


1400 




250 


Insulin 


110 


12626 


1800 


n/a 




6200 


SPECT 


267 


22 


1.3 




37 


5.5 


Promoter 


106 


57 


14 




4400 


38 


Lymphography 


148 


19 


1.1 




93 


7.0 


Musk (mixed) 


476 


167 


27 




2800 


500 


Gaucher's 


40 


590 


1.3 




130 


29 


GFOP 


595 


18 


2.7 




1400 


40 


Children 


55 


124 


2.7 




4000 


110 



Table 1: Average runtimes [s] for imputing the analyzed data sets. Runtimes are averaged over the 
amount of missing values since this has a negligible effect on computing time. 



of informal reliability check for each variable. The full potential of missForest is deployed when the 
data includes complex interactions or nonlinear relations between variables of unequal scales and dif- 
ferent type. Furthermore, missForest can be applied to high-dimensional data sets where the number of 
variables may greatly exceed the number of observations to a large extent and still provides excellent 
imputation results. 
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10 50 100 250 500 



1 


36.8/35.5 
2.5s 


27.4/32.3 
3.2s 


20.4/31.3 
3.9s 


17.2/30.0 
5.8s 


16.0/30.8 
9.2s 


2 


34.9/31.8 
6.9s 


24.8/29.2 
11.8s 


18.3/28.8 
15.0s 


16.0/28.6 
25.2s 


15.5/29.1 
39.3s 


4 


34.9/31.3 
16.5s 


25.1s 


17.9/26.2 
35.0s 


15.4/28.2 
49.0s 


83.3s 


8 


34.7/31.4 
39.2s 


24.3/28.9 
57.4s 


18.1/27.8 
84.4s 


15.2/27.8 
130.2s 


15.7/28.6 
190.8s 


16 


34.6/30.9 
68.7s 


24.3/28.7 
99.7s 


18.1/28.0 
172.2s 


15.4/27.8 
237.6s 


15.6/28.5 
400.7s 



Table 2: Average imputation error (NRMSE/PFC in percent) and runtime (in seconds) with different 
numbers of trees (ntree) grown in each forest and variables tried (mtry) at each node of the trees. Here, 
we consider the GFOP data set with artificially introduced 10% of missing values. For each compari- 
son 50 simulation runs were performed using always the same missing value matrix for all numbers of 
trees/randomly selected variables for a single simulation. 
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